Extracting key frames from a video sequence

ABSTRACT

A method of extracting key frames from a video sequence, which video sequence comprises MPEG compressed video data having block motion vectors. The method initially partially decompresses ( 202 ) the MPEG compressed video data to obtain block motion vectors and converts ( 204 ) the block motion vectors to forward block motion vectors. The method then generates ( 206 ) global motion signals and generates ( 306 ) dominant global direction clusters. The method then selects ( 402,404,406 ) potential key frames of the video sequence using the generated dominant global direction clusters. The method lastly decompresses ( 408 ) the selected key frames to obtain the extracted key frames.

TECHNICAL FIELD OF THE INVENTION

The present invention relates generally to extracting key frames from avideo sequence. In particular, the present invention relates to a methodand apparatus for extracting key frames from a video sequence and to amethod and apparatus for generating a representation of a videosequence. The invention also relates to a computer readable mediumcomprising a computer program for implementing such methods.

BACKGROUND ART

Video cameras have become increasingly popular in recent times. It is acommon occurrence for camera users to store one or more video clips oneach videocassette or other medium. With the proliferation of videodata, there has thus arisen a need for users to organise and managetheir video data.

One rudimentary method for organising and managing the video datainvolves keyword-based searches and fast forward/backward browsing toaccess the specific portions of a video. However, the keyword-based dataretrieval systems can not precisely and uniquely represent video datacontent. The fast forward/backward operations are extremely slow andinefficient.

Another popular method for accessing specific portions of video clipsuses key frames as representative frames extracted from a videosequence. Key frame extraction permits fast video browsing and alsoprovides a powerful tool for video content summarisation andvisualisation.

However, video summarisation and visualisation based on the extractionof frames at regular time instances exploits neither shot information orframe similarity. For short important shots, it may not haverepresentatives and for long shots it may have multiple frames withsimilar content.

Another popular method for producing video summaries is to usecut/change detection to select representative key frames for shots in amovie. A typical approach to select representatives is to use thecut-points as key frames. The key frames are then used as the summary.Typically, the cut-points are determined from colour histograms of theframes. A cut-point is determined when the difference between colourhistograms of adjacent frames is greater than a predetermined threshold.However, this method sometimes generates too many key frames, and inmany cases (eg. movies, news, reports, etc), the selected key frames cancontain many similar frames (eg. of the newsreader).

These histogram techniques are pixel based or block based. Thresholdingmethods are then employed to determine scene changes. These techniquesoften produce erroneous results because changes in lighting can cause ashift in colour between successive frames that depict the same scene.Similarly, a camera zoom shot often produces too many key frames.

U.S. Pat. No. 5,995,095 by Ratakonda et al describes a method ofhierarchical digital video summarisation and browsing which includesinputting a digital video signal for a digital video sequence andgenerating a hierarchical summary based on keyframes of the videosequence. The hierarchical summary contains multiple levels, wherelevels vary in terms of detail (ie. the number of frames). The coarsest,or most compact, level provides the most salient features and containsthe least number of frames.

The user may be presented with most compact (coarsest) level summary,ie. the most compact summary. The user then may tag a parent and see thechild (ren) frames in finer level. Tagging frames in the finest levelresult in playback of the video. The method selects the keyframes forinclusion in the finest level of the hierarchy by utilising shotboundary detection. Shot boundary detection is performed using athreshold method, where differences between histograms of successiveframes are compared to determine shot boundaries (ie. scene changes).The hierarchical video summarisation method can be performed on MPEGcompressed video with minimal decoding of the bitstream. The videosummarisation method can optionally and separately determine an imagemosaic of any pan motion and a zoom summary of any zoom. However,Ratakonda et al discloses that to incorporate the automatic pan/zoomdetect/extract functionality the entire frame bitstream needs to bedecoded. Moreover, Ratakonda et al discloses pan and zoom detectionmethods based on motion vectors based at the pixel level which arecomputational expensive and inefficient. In addition, Ratakonda et aldescribes constructing an image mosaic of a panoramic view of the shotframes, which cannot be effectively implemented in real world complexshots, where background/foreground changes or complicated camera effectsmay appear.

SUMMARY OF THE INVENTION

It is an object of the present invention to substantially overcome, orat least ameliorate, one or more disadvantages of existing arrangements.

According to a one aspect of the invention, there is provided a methodof extracting key frames from a video sequence, wherein the videosequence comprises compressed video data having motion vectors; themethod comprising the steps of: generating global motion signals basedon the motion vectors; generating dominant global direction clustersbased on said generated global motion signals; selecting key framesusing said generated dominant global direction clusters; anddecompressing said selected key frames to obtain said extracted keyframes.

According to another aspect of the invention, there is provided a methodof generating a representation of a video sequence, wherein said videosequence comprises compressed video data having block motion vectors,the method comprising the steps of: decompressing the compressed videodata to obtain said block motion vectors; converting said block motionvectors to forward block motion vectors; generating global motionsignals based on the forward block motion vectors; generating dominantglobal direction clusters based on said generated global motion signals;selecting potential key frames of the video sequence using saidgenerated dominant global direction clusters and a set of predefinedrules; removing redundant key frames from said selected potential keyframes resulting in remaining selected key frames; and decompressingsaid remaining selected key frames to obtain said representation of thevideo sequence.

According another aspect of the invention, there is provided a method ofextracting key frames from one or more video clips, wherein each saidvideo clip -comprises MPEG compressed video data having block motionvectors, the method comprising the steps of: partially decompressing theMPEG compressed video data to obtain said block motion vectors;converting said block motion vectors to forward block motion vectors;generating a pan global motion signal, a zoom global motion signal, anda tilt global motion signal based on the forward block motion vectors;generating dominant global direction clusters based on said pan, tilt,and zoom generated global motion signals, wherein said dominant globaldirection clusters comprise one or more of a pan left, pan right, tiltup, tilt down, zoom in, zoom out and global still motion cluster;selecting potential key frames of each said video clip using saidgenerated dominant global direction clusters and a set of predefinedrules; removing redundant key frames from said selected potential keyframes using a predefined set of heuristic rules resulting in a firstset of remaining selected key frames; removing similar and/or repeatedkey frames from said first set of remaining selected key frames using acolour histogram technique resulting in a second set of remainingselected key frames; and decompressing said second set of remainingselected key frames to obtain said extracted key frames.

According another aspect of the invention, there is provided apparatusfor extracting key frames from a video sequence, wherein the videosequence comprises compressed video data having motion vectors; theapparatus comprising: means for generating global motion signals basedon the motion vectors; means for generating dominant global directionclusters based on said generated global motion signals; means forselecting key frames using said generated dominant global directionclusters; and means for decompressing said selected key frames to obtainsaid extracted key frames.

According another aspect of the invention, there is provided apparatusfor generating a representation of a video sequence, wherein said videosequence comprises compressed video data having block motion vectors,the apparatus comprising: means for decompressing the compressed videodata to obtain said block motion vectors; means for converting saidblock motion vectors to forward block motion vectors; means forgenerating global motion signals based on the forward block motionvectors; means for generating dominant global direction clusters basedon said generated global motion signals; means for selecting potentialkey frames of the video sequence using said generated dominant globaldirection clusters and a set of predefined rules; means for removingredundant key frames from said selected potential key frames resultingin remaining selected key frames; and means for decompressing saidremaining selected key frames to obtain said representation of the videosequence.

According another aspect of the invention, there is provided apparatusfor extracting key frames from one or more video clips, wherein eachsaid video clip comprises MPEG compressed video data having block motionvectors, the apparatus comprising: means for partially decompressing theMPEG compressed video data to obtain said block motion vectors; meansfor converting said block motion vectors to forward block motionvectors; means for generating a pan global motion signal, a zoom globalmotion signal, and a tilt global motion signal based on the forwardblock motion vectors; means for generating dominant global directionclusters based on said pan, tilt, and zoom generated global motionsignals, wherein said dominant global direction clusters comprise one ormore of a pan left, pan right, tilt up, tilt down, zoom in, zoom out andglobal still motion cluster; means for selecting potential key frames ofeach said video clip using said generated dominant global directionclusters and a set of predefined rules; means for removing redundant keyframes from said selected potential key frames using a predefined set ofheuristic rules resulting in a first set of remaining selected keyframes; means for removing similar and/or repeated key frames from saidfirst set of remaining selected key frames using a colour histogramtechnique resulting in a second set of remaining selected key frames;and means for decompressing said second set of remaining selected keyframes to obtain said extracted key frames.

According another aspect of the invention, there is provided a computerreadable medium comprising a computer program for extracting key framesfrom a video sequence, wherein the video sequence comprises compressedvideo data having motion vectors; the computer program comprising: codefor generating global motion signals based on the motion vectors; codefor generating dominant global direction clusters based on saidgenerated global motion signals; code for selecting key frames usingsaid generated dominant global direction clusters; and code fordecompressing said selected key frames to obtain said extracted keyframes.

According another aspect of the invention, there is provided a computerreadable medium comprising a computer program for generating arepresentation of a video sequence, wherein said video sequencecomprises compressed video data having block motion vectors, thecomputer program comprising: code for decompressing the compressed videodata to obtain said block motion vectors; code for converting said blockmotion vectors to forward block motion vectors; code for generatingglobal motion signals based on the forward block motion vectors; codefor generating dominant global direction clusters based on saidgenerated global motion signals; code for selecting potential key framesof the video sequence using said generated dominant global directionclusters and a set of predefined rules; code for removing redundant keyframes from said selected potential key frames resulting in remainingselected key frames; and code for decompressing said remaining selectedkey frames to obtain said representation of the video sequence.

According another aspect of the invention, there is provided a computerreadable medium comprising a computer program for extracting key framesfrom one or more video clips, wherein each said video clip comprisesMPEG compressed video data having block motion vectors, the computerprogram comprising: code for partially decompressing the MPEG compressedvideo data to obtain said block motion vectors; code for converting saidblock motion vectors to forward block motion vectors; code forgenerating a pan global motion signal, a zoom global motion signal, anda tilt global motion signal based on the forward block motion vectors;code for generating dominant global direction clusters based on saidpan, tilt, and zoom generated global motion signals, wherein saiddominant global direction clusters comprise one or more of a pan left,pan right, tilt up, tilt down, zoom in, zoom out and global still motioncluster; code for selecting potential key frames of each said video clipusing said generated dominant global direction clusters and a set ofpredefined rules; code for removing redundant key frames from saidselected potential key frames using a predefined set of heuristic rulesresulting in a first set of remaining selected key frames; code forremoving-similar and/or repeated key frames from said first set ofremaining selected key frames using a colour histogram techniqueresulting in a second set of remaining selected key frames; and code fordecompressing said second set of remaining selected key frames to obtainsaid extracted key frames.

According to a still further aspect of the invention, there is provideda video summary produced by any one of the methods described above.

BRIEF DESCRIPTION OF THE DRAWINGS

A number of preferred embodiments of the present invention will now bedescribed with reference to the drawings, in which:

FIG. 1 is a flow diagram of an overview of a method of extracting keyframes from a video sequence in accordance with a first embodiment;

FIG. 2 is a flow diagram of the sub-steps of step 106 of the methodshown in FIG. 1;

FIG. 3 is a flow diagram of the sub-steps of step 108 of the methodshown in FIG. 1;

FIG. 4 is a flow diagram of the sub-steps of step 110 of the methodshown in FIG. 1;

FIG. 5A is a graph of the (pan) global motion signal x(t) for anexemplary video sequence;

FIG. 5B is a graph of the (tilt) global motion signal y(t) for the sameexemplary video sequence used in FIG. 5A;

FIG. 5C is a graph of the (zoom) global motion signal z(t) for the sameexemplary video sequence used in FIG. 5A;

FIG. 5D is a graph of the dominant global direction clusters as afunction of time, the potential key frames, and the generated key framesof the same exemplary video sequence used in FIG. 5A; and

FIG. 6 is a schematic block diagram of a general-purpose computer uponwhich the embodiments of the present invention can be practiced; and

DETAILED DESCRIPTION INCLUDING BEST MODE

Where reference is made in any one or more of the accompanying drawingsto steps and/or features, which have the same reference numerals, thosesteps and/or features have for the purposes of this description the samefunction(s) or operation(s), unless the contrary intention appears.

Video camera users often pan and zoom from one location to another toshow the connection of different places and events and hold the camerastill to focus on an important event or something particular interest tothem. The key frame extraction method in accordance with the embodimentis based on dominant global direction clusters of camera motionestimated from compressed video data. The method takes advantage of thefact that the incoming video is already in compressed form. Theadvantage is that computational cost of fully decompressing every frameis not necessary. Only a selected number of key frames need bedecompressed at the end of the process. The method also attempts tocapture user's interests and some important events. It gives areasonable number of efficient and effective key frames depending on thevideo complexity.

The principles of the preferred method described herein have generalapplicability to a method of extracting key frames from a videosequence. However, for ease of explanation, the steps of the preferredmethod are described with reference to video clips. A video clip isdefined as that section of video between record-start and record-endcapture event. However, it is not intended that the present invention belimited to the described method. For example, the invention may haveapplication to commercial movies and the like having many such clips.This method is also applicable to motion-compensated predictivecompressed video such as MPEG2. However, it is not intended to belimited thereto. Any compressed video sequences incorporating motionvectors would be suitable.

Turning now to FIG. 1, there is shown a flow diagram of an overview of amethod of extracting key frames from a video clip in accordance with afirst embodiment. The key frame extraction method 100 commences at step102 where any necessary parameters are initialised. The method 100continues to step 104 where a MPEG2 compressed video clip is input forprocessing by the method 100. The method 100 then generates globalmotion signals of the video clip. These global motion signals compriseglobal motion parameters for most frames of the video clip. These globalmotion parameters comprise a pan parameter, a tilt parameter, and a zoomparameter for each available frame of the video clip. The global motionsignals comprise these parameters as a function of time for the wholevideo sequence. For example, FIGS. 5A to 5C show graphs of the globalmotion signals for an exemplary video sequence. The manner in whichthese global motion signals are generated is described in more detailbelow.

After step 106, the method proceeds to step 108, where the methodgenerates clusters of the dominant direction of the global motion forthe entire video sequence (herein called “dominant global directionclusters”). During this step 108, the method takes as input all threeglobal motion signals for each available frame and determines thedominant direction of the global motion signal for that frame. Thedominant direction for a frame can either be pan left, pan right, tiltup, tilt down, zoom in, zoom out and global still. The dominantdirections of all available frames are then clustered together to formthe dominant global direction clusters for the whole video sequence. Forexample, FIG. 5D is a graph of the dominant global direction clusters asa function of time (viz frame number) of the same exemplary videosequence as used in FIG. 5A. As can seen from FIG. 5D, the videosequence is segmented into clusters beginning at frame number=0 with aglobal still cluster, then a zoom in cluster, pan left cluster, andlastly a global still cluster. The manner in which these dominant globaldirection clusters are generated is described in more detail below.

After step 108, the method proceeds to step 110, where key frames areextracted from the video sequence. During this step 110, a number ofpotential key frames are selected from dominant global directionclusters using a set of predefined rules. The method then removesredundant potential key frames and finally fully decodes the remainingresultant key frames. The manner in which these key frames are extractedis described in more detail below. The method then outputs the decodedkey frames as a summary of the video sequence.

Turning now to FIG. 2, there is shown in more detail a flow diagram ofthe sub-steps of step 106 of the key frame extraction method of FIG. 1.After the MPEG2 video sequence has been input 104, the method proceedsto step 202. During this step 202, the MPEG2 compressed video sequenceis partially decompressed in a known manner to obtain all the MPEG2block motion vectors of the video sequence.

The MPEG2 compression standard for moving images exploits both spatialand temporal redundancy of video sequences. MPEG2 utilises a number ofmodes of compression. One mode is called intraframe coding, wherein anumber of pictures of the video is individually and independentlycompressed or encoded. Intraframe coding exploits the spatial redundancythat exists between adjacent pixels of a picture. Pictures encoded usingonly intraframe encoding are called I-pictures. MPEG2 utilises anothermode called interframe coding, which exploits the temporal redundancybetween pictures. Temporal redundancy results from a high degree ofcorrelation between adjacent pictures.

MPEG2 exploits this redundancy by computing an interframe differencesignal called the prediction error. In computing the prediction error,MPEG2 has adopted a macro-block approach for motion compensation. Atarget macro-block in a frame to be encoded is matched with a mostsimilar displaced macro-block in a previous (or consecutive) frame,called a reference image. A (block) motion vector that describes adisplacement from the target macro-block to the prediction macro-blockindicates the position of the best matching macro-block, or predictionmacro-block. The (block) motion vector information is encoded andtransmitted along with compressed image frames. In forward prediction, atarget macro-block in the picture to be encoded is matched with a set ofdisplaced macro-blocks of the same size in a past picture called thereference picture. A (block) motion vector that describes the horizontaland vertical displacement from the target macro-block to the predictionmacro-block indicates the position of this best matching predictionmacro-block. Pictures coded in MPEG2 using forward prediction are calledP-pictures.

The MPEG2 compression standard also uses bi-directional temporalprediction. Pictures coded with bi-directional prediction use tworeference pictures, one in the past and one in the future. A targetmacro-block in bi-directionally coded pictures can be predicted by aprediction macro-block from the past reference picture (forwardprediction), or one from the future reference picture (backwardprediction), or by an average of two prediction macro-blocks, one fromeach reference picture (interpolation). In every case, a predictionmacro-block from a reference picture is associated with a motion vector,so that up to two motion vectors per macro-block may be used withbi-directional prediction.

During this step 202, the key frame extraction method decompresses thevideo sequence to obtain all the MPEG2 block motion vectors of the videosequence. The method does not fully decode the sequence, it does notundertake any interframe decoding during this step. It will beappreciated to a person skilled in the art not all frames are partiallydecoded. Namely, not all frames have motion block vectors.

After the method has partially decompressed 202 the MPEG2 videosequence, the method proceeds to step 204. In this step 204, the methodconverts all the backward (block) motion vectors to forward (block)motion vectors, which basically requires just a change of reference anddirection. These forward (block) motion vectors are representative oflocal displacement vectors from which global motion can be calculated.

The method then continues to step 206, where the method calculates threeglobal motion parameters for each available frame that comprise forwardmotion vectors. These global motion parameters are calculated from theforward motion vectors of a frame using the method described in “GlobalZoom/Pan estimation and Compensation for video Compression” from ProcICASSP91 by Yi Tong Tse and Richard Baker, pages 2725 to 2728. Threeglobal parameters are computed, x as pan parameter, y as tilt parameterand z as zoom parameter. The global motion parameters are calculated foreach available frame. Three global motion signals, X=x(t), Y=y(t) andZ=z(t), are then formed from these global parameters as a function oftime for the video sequence. Examples of such generated global motionsignals are shown in FIGS. 5 (a)-(c).

Turning now to FIG. 5A, there is shown a graph of the (pan) generatedglobal motion signal x(t) for an exemplary video sequence. Thehorizontal axis represents the number of frames of the video sequencefrom frame numbers zero to frame number 260. The vertical axis is thepan parameter x varying from −10 pixels to +20 pixels. It has been foundthat a pan parameter x=−10 to +10 is generally indicative of little panmovement. A pan parameter x of over 10 is indicative of a pan leftmovement. A pan parameter x of less than −10 is indicative of a panright movement. As can be seen, there is a left pan movement from aboutframe number 130 to 230 during the video sequence.

Turning now to FIG. 5B, there is shown a graph of the (tilt) generatedglobal motion signal y(t) for the same exemplary video sequence used inFIG. 5A. Similar to FIG. 5A, the horizontal axis represents the numberof frames of the video sequence from frame numbers zero to frame number260. The vertical axis is the tilt parameter y varying from −20 pixelsto +20 pixels. It has been found that a tilt parameter y=−10 to +10 isgenerally indicative of little tilt movement. A tilt parameter y of over10 is indicative of a tilt up movement. A pan parameter y of less than−10 is indicative of a tilt down movement. As can be seen, there areshort tilt movements at about frame numbers 60, 125, 150, 160, 220 and240 during the video sequence.

Turning now to FIG. 5C, there is shown a graph of the (zoom) generatedglobal motion signal z(t) for the same exemplary video sequence used inFIG. 5A. Similar to FIG. 5A, the horizontal axis represents the numberof frames of the video sequence from frame numbers zero to frame number260. The vertical axis is the zoom parameter z varying from 0 to 1.2zoom factor. It has been found that a zoom parameter z=0.98 to 1.02 isgenerally indicative of little zoom movement. A zoom parameter z of over1.02 is indicative of zoom out movement and a zoom parameter z of lessthan 0.98 is indicative of zoom in movement. As can be seen, there is azoom in movement from about frame number 40 to 140 during the videosequence.

Turning now to FIG. 3, there is shown in more detail a flow diagram ofthe sub-steps of step 108 of the method shown in FIG. 1. After the keyframe extraction method computes 206 the global motion signals, themethod then proceeds to step 302. During step 302, the method thresholdseach of the global signals. That is the key frame extraction methodconverts each global parameter of each available frame to one of threediscrete global parameter values +1, 0, −1. For example, in the case ofthe global pan parameter x, if −10<=x<=+10 convert x to 0; otherwise ifx>10 convert x to 1, otherwise if x<−10 convert x to −1. Similarly forthe global tilt parameter y. In the case of the global zoom parameter z,if 0.98<=z<=1.02 convert z to 0; otherwise if z<0.98 convert z to −1,otherwise if z>1.02 convert z to +1.

After the thresholding step 302, the method continues to step 304, wherenoise is removed from the discrete global motion signals. The noise isremoved from the discrete global motion signals using known techniquesof morphological processing where the discrete global parameters arereduced to a more revealing shape. This technique removes shorttransient spikes and fills in any holes in the discrete global motionsignals. See “Fundamentals of Digital Image Processing” by A. K. Jain,page 384, which describes the basic operations of morphologicalprocessing. After step 304 the method continues to step 306.

During step 306, the key frame extraction method generates dominantglobal direction clusters based on the noise reduced discrete globalmotion signals over the video sequence. The method takes as input allthree noise reduced discrete global motion signals for each availableframe and determines the dominant direction of the global motion signalfor that frame. The dominant direction for a frame can either be panleft, pan right, tilt up, tilt down, zoom in, zoom out and global still.The dominant directions of all available frames are then clusteredtogether to form the dominant global direction clusters for the wholevideo sequence. For example, the ‘global still’ motion captures thecamera still and/or small local object motion, viz all the discreteglobal motion values for that frame that are close to or equal to zero.In the example of a pan left motion, the discrete global motion valuesfor a frame are (pan=1, zoom=0, tilt=0). If however, combined motionsexist for a frame (e.g. discrete global signals pan=1, zoom=1, tilt=0),then the dominant direction of the global motion is the largest one ofthe three original global motion signals computed during step 206. Inthe latter case, a comparison is made to determine the largest originalglobal motion signal in a frame. Preferably, the original global motionsignals are first averaged over time to remove any transients prior tothe comparison. As can be seen from FIGS. 5(a) to 5(c), the zoom globalparameter has a different metric than the tilt or pan global parameters.The zoom global motion signal may be normalised so that a directcomparison can be made with the tilt or pan global motion parameters inorder to determine the dominant direction of the global motion. Once thedominant direction of the global motion for each frame is determined,these dominant directions may be grouped together to form clusters. Asmentioned previously, these are called herein dominant global directionclusters. The clustering sub-step groups the same type of consecutiveand close clusters. Very short motion segments are ignored.

Turning now to FIG. 5D, there is shown a graph of the dominant globaldirection clusters as a function of time (viz frame number) of the sameexemplary video sequence used in FIG. 5A. As can be seen, there isglobal still cluster from frame number zero to 40. During this time,there is no dominant global motion and the video sequence is relativelystationary. There is a zoom in cluster during frame numbers 40 to 130.During this time the dominant global motion is a zoom in. Following thezoom in cluster, there is a pan left cluster during frame number 130 to230. During this duration the dominant global motion is a pan left. Thelast cluster is a global still cluster from frame numbers 230 to 260.During this period, there is no dominant global motion.

Returning now to FIGS. 5A to 5C, there can be seen that the pan leftmovement and zoom in movement overlap during the period from framenumbers 130 to 140. The key frame extraction method determines thedominant global motion during this period is pan left. It also can beseen that the tilt movements at about frame numbers 60, 125, 150, 160,and 220 overlap both the pan left and zoom in movements. However, thetilt motion segments are short and the key frame extraction methodignores them during the clustering.

Turning now to FIG. 4, there is shown a flow diagram of the sub-steps ofstep 110 of the key frame extraction method of FIG. 1. After thedominant global direction clusters have been determined 306, the methodproceeds to step 402. During this step 402, potential key frames areselected from the dominant global direction clusters. They are notextracted from MPEG2 compressed video at this stage. They are selectedin accordance with the following predefined set of rules:

-   -   One or more frames are selected for a pan or tilt cluster        depending on the length and speed of the pan or tilt.    -   One or more frames are selected for a zoom cluster depending on        the zoom factor and length of the zoom.    -   Only one frame is selected for each global still cluster.

For example, the potential key frames selected from a pan cluster can beat the start of pan, the middle of the pan and the end of the pan oronly one potential key frame from start of pan depending on the lengthand speed of the pan cluster.

Returning now to FIG. 5 (d), there is shown potential key framesselected for the video sequence used in FIGS. 5 (a) to (c). In thisexample, there is one key frame selected for the first global stillcluster; two key frames selected for the zoom in cluster at thebeginning and end of the zoom in cluster; two key frames selected at thebeginning and end of the pan left cluster; and one key frame selectedfor the last global still cluster.

After step 402, the method proceeds to step 404. In step 404, redundantkey frames of the selected key frames 402 are removed based on a set ofpredefined heuristic rules that take into account photographic habits.For instance, global still is more important than other motion clusters.It often captures user's interest or some particular event. It alsosometimes captures more focused images than moving images. Some of theseheuristic rules that may be applied are:

-   -   A potential key frame selected at the beginning/end of a cluster        which is not a “global still” is removed if it follows/is        followed by a “global still” cluster. For example, a potential        key frame selected at the end of a “zoom in” cluster is removed        when it is followed by a “global still” cluster. The key frame        selected from a “global still” often have better quality than        the one extracted from the end of “zoom in” cluster and they are        similar.    -   A potential key frame selected at the beginning of a pan/tilt is        removed if it follows a zoom cluster. For example, a “zoom in”        cluster is followed by a very short ignored motion segment and        then followed by a “pan right” cluster. The potential key frame        from beginning of the “pan right” is removed. However, the set        of heuristic rules are not limited to these rules of use.

The method then proceeds to step 406, where the method removes similarand/or repeated key frames of the selected key frames remaining afterstep 404. Notwithstanding the use of the predefined heuristic rules,scenes are still sometimes repeated and similar key frames may happen indifferent time. This step 404 removes these similar key frames by usingimage similarity measurement. Existing methods of measuring imagesimilarity (eg. colour histogram comparison) can be used. In computingthe colour histograms, the closest I frames in the MPEG2 video can usedas the key frames instead of P or B frames. Then the DC coefficients ofthe MPEG2 compressed image can be used to generate a low-resolutionimage. The image similarity measure can be performed using the DC keyframe images.

After step 406, the method proceeds to step 408, where the selected keyframes still remaining from step 406 are fully decoded from the MPEG2video sequence. These decoded remaining key frames form a summary of theMPEG2 video sequence and are output 112 (e.g. stored on hard disk). Themethod then terminates at step 114. In this way, the method performs afast and efficient key frame extraction.

Returning now to FIG. 5D, there is shown the extracted key frames forthe video sequence used in FIGS. 5A to 5C. During the step 404, thesecond, fourth and fifth potential key frames have been removed usingthe above set of heuristic rules leaving the extracted key frames asshown. The removal step 406 in this example does not find any similar orrepeated key frames and leaves the extracted key frames as shown. Themethod then decodes these extracted key frames to form the video summaryof the video sequence.

Preferred Embodiment of Apparatus

The method of extracting key frames is preferably practiced using aconventional general-purpose computer system 600, such as that shown inFIG. 6 wherein the processes of FIGS. 6 may be implemented as software,such as an application program executing within the computer system 600.In particular, the steps of method of extracting key frames are effectedby instructions coded as software that are carried out by the computer.The software may be divided into two separate parts; one part forcarrying out the key frame extraction methods; and another part tomanage the user interface between the latter and the user. The softwaremay be stored in a computer readable medium, including the storagedevices described below, for example. The software is loaded into thecomputer from the computer readable medium, and then executed by thecomputer. A computer readable medium having such software or computerprogram recorded on it is a computer program product. The use of thecomputer program product in the computer preferably effects anadvantageous apparatus for extracting key frames in accordance with theembodiments of the invention.

The computer system 600 comprises a computer module 601, input devicessuch as a keyboard 602 and mouse 603, output devices including a printer615 and a display device 614. A Modulator-Demodulator (Modem)transceiver device 616 is used by the computer module 601 forcommunicating to and from a communications network 620, for exampleconnectable via a telephone line 621 or other functional medium. Themodem 616 can be used to obtain access to the Internet, and othernetwork systems, such as a Local Area Network (LAN) or a Wide AreaNetwork (WAN).

The computer module 601 typically includes at least one processor unit605, a memory unit 606, for example formed from semiconductor randomaccess memory (RAM) and read only memory (ROM), input/output (I/O)interfaces including a video interface 607, and an I/O interface 613 forthe keyboard 602 and mouse 603 and optionally a joystick (notillustrated), and an interface 608 for the modem 616. A storage device609 is provided and typically includes a hard disk drive 610 and afloppy disk drive 611. A magnetic tape drive (not illustrated) may alsobe used. A CD-ROM drive or DVD drive 612 is typically provided as anon-volatile source of data. The components 605 to 613 of the computermodule 601, typically communicate via an interconnected bus 604 and in amanner, which results in a conventional mode of operation of thecomputer system 600 known to those in the relevant art. Examples ofcomputers on which the embodiments can be practised include IBM-PC's andcompatibles, Sun Sparcstations or alike computer systems evolvedtherefrom.

Typically, the application program of the preferred embodiment isresident on the hard disk drive 610 and read and controlled in itsexecution by the processor 605. Intermediate storage of the program andany data fetched from the network 620 may be accomplished using thesemiconductor memory 606, possibly in concert with the hard disk drive610. In some instances, the application program may be supplied to theuser encoded on a CD-ROM or floppy disk and read via the correspondingdrive 612 or 611, or alternatively may be read by the user from thenetwork 620 via the modem device 616. Still further, the software canalso be loaded into the computer system 600 from other computer readablemedium including magnetic tape, a ROM or integrated circuit, amagneto-optical disk, a radio or infra-red transmission channel betweenthe computer module 601 and another device, a computer readable cardsuch as a PCMCIA card, and the Internet and Intranets including emailtransmissions and information recorded on websites and the like. Theforegoing is merely exemplary of relevant computer readable mediums.Other computer readable mediums may be practiced without departing fromthe scope and spirit of the invention.

The computer system 600 has the capability to store large amounts ofvideo data, which serves as input to the key frame extraction method.The video data may be input to the computer system 600 via a DVD-ROMdrive 612 or directly via a camcorder (not shown) via input 608.

The method of extracting key frames may alternatively be implemented indedicated hardware such as one or more integrated circuits performingthe functions or sub functions of FIG. 1. Such dedicated hardware may beincorporated in a camcorder or VCR or such like, and may include graphicprocessors, digital signal processors, or one or more microprocessorsand associated memories.

INDUSTRIAL APPLICABILITY

It is apparent from the above that the embodiment(s) of the inventionare applicable to the video processing industries. The key frameextraction method has many applications, amongst which some are: visualidentification of video content; video indexing; video browsing; andvideo editing. Returning now to FIG. 5D, there is shown the extractedkey frames for the video sequence used in FIGS. 5A to 5C. During thestep 404, the second, fourth and fifth potential key frames have beenremoved using the above set of heuristic rules leaving the key frames asshown.

The foregoing describes only one embodiment/some embodiments of thepresent invention, and modifications and/or changes can be made theretowithout departing from the scope and spirit of the invention, theembodiment(s) being illustrative and not restrictive.

1.-7. (canceled)
 8. A method of generating a representation of a videosequence, wherein said video sequence comprises compressed video datahaving block motion vectors, the method comprising the steps of:decompressing the compressed video data to obtain said block motionvectors; converting said block motion vectors to forward block motionvectors; generating global motion signals based on the forward blockmotion vectors; generating dominant global direction clusters based onsaid generated global motion signals; selecting potential key frames ofthe video sequence using said generated dominant global directionclusters and a set of predefined rules; removing redundant key framesfrom said selected potential key frames resulting in remaining selectedkey frames; and decompressing said remaining selected key frames toobtain said representation of the video sequence.
 9. A method as claimedin claim 8, wherein said step of generating global motion signalscomprise generating a pan global motion signal, a zoom global motionsignal, and a tilt global motion signal.
 10. A method as claimed inclaim 8, wherein said dominant global direction clusters comprise one ormore of a pan left, pan right, tilt up, tilt down, zoom in, zoom out andglobal still motion cluster.
 11. A method as claimed in claim 8, whereinsaid step of generating dominant global direction clusters comprises thesub-steps of: generating discrete global motion signals from saidgenerated global motion signals; removing noise from said generateddiscrete global motion signals; and generating dominant global directionclusters based on said noise reduced discrete global motion signals. 12.A method of extracting key frames from one or more video clips, whereineach said video clip comprises MPEG compressed video data having blockmotion vectors, the method comprising the steps of: partiallydecompressing the MPEG compressed video data to obtain said block motionvectors; converting said block motion vectors to forward block motionvectors; generating a pan global motion signal, a zoom global motionsignal, and a tilt global motion signal based on the forward blockmotion vectors; generating dominant global direction clusters based onsaid pan, tilt, and zoom generated global motion signals, wherein saiddominant global direction clusters comprise one or more of a pan left,pan right, tilt up, tilt down, zoom in, zoom out and global still motioncluster; selecting potential key frames of each said video clip usingsaid generated dominant global direction clusters and a set ofpredefined rules; removing redundant key frames from said selectedpotential key frames using a predefined set of heuristic rules resultingin a first set of remaining selected key frames; removing similar and/orrepeated key frames from said first set of remaining selected key framesusing a colour histogram technique resulting in a second set ofremaining selected key frames; and decompressing said second set ofremaining selected key frames to obtain said extracted key frames.
 13. Amethod as claimed in claim 12, wherein said step of generating dominantglobal direction clusters comprises the sub-steps of: generatingdiscrete global motion signals from said generated global motionsignals; removing noise from said generated discrete global motionsignals; and generating dominant global direction clusters based on saidnoise reduced discrete global motion signals.
 14. (canceled) 15.Apparatus for generating a representation of a video sequence, whereinsaid video sequence comprises compressed video data having block motionvectors, the apparatus comprising: means for decompressing thecompressed video data to obtain said block motion vectors; means forconverting said block motion vectors to forward block motion vectors;means for generating global motion signals based on the forward blockmotion vectors; means for generating dominant global direction clustersbased on said generated global motion signals; means for selectingpotential key frames of the video sequence using said generated dominantglobal direction clusters and a set of predefined rules; means forremoving redundant key frames from said selected potential key framesresulting in remaining selected key frames; and means for decompressingsaid remaining selected key frames to obtain said representation of thevideo sequence.
 16. Apparatus for extracting key frames from one or morevideo clips, wherein each said video clip comprises MPEG compressedvideo data having block motion vectors, the apparatus comprising: meansfor partially decompressing the MPEG compressed video data to obtainsaid block motion vectors; means for converting said block motionvectors to forward block motion vectors; means for generating a panglobal motion signal, a zoom global motion signal, and a tilt globalmotion signal based on the forward block motion vectors; means forgenerating dominant global direction clusters based on said pan, tilt,and zoom generated global motion signals, wherein said dominant globaldirection clusters comprise one or more of a pan left, pan right, tiltup, tilt down, zoom in, zoom out and global still motion cluster; meansfor selecting potential key frames of each said video clip using saidgenerated dominant global direction clusters and a set of predefinedrules; means for removing redundant key frames from said selectedpotential key frames using a predefined set of heuristic rules resultingin a first set of remaining selected key frames; means for removingsimilar and/or repeated key frames from said first set of remainingselected key frames using a colour histogram technique resulting in asecond set of remaining selected key frames; and means for decompressingsaid second set of remaining selected key frames to obtain saidextracted key frames.
 17. (canceled)
 18. A computer readable mediumcomprising a computer program for generating a representation of a videosequence, wherein said video sequence comprises compressed video datahaving block motion vectors, the computer program comprising: code fordecompressing the compressed video data to obtain said block motionvectors; code for converting said block motion vectors to forward blockmotion vectors; code for generating global motion signals based on theforward block motion vectors; code for generating dominant globaldirection clusters based on said generated global motion signals; codefor selecting potential key frames of the video sequence using saidgenerated dominant global direction clusters and a set of predefinedrules; code for removing redundant key frames from said selectedpotential key frames resulting in remaining selected key frames; andcode for decompressing said remaining selected key frames to obtain saidrepresentation of the video sequence.
 19. A computer readable mediumcomprising a computer program for extracting key frames from one or morevideo clips, wherein each said video clip comprises MPEG compressedvideo data having block motion vectors, the computer program comprising:code for partially decompressing the MPEG compressed video data toobtain said block motion vectors; code for converting said block motionvectors to forward block motion vectors; code for generating a panglobal motion signal, a zoom global motion signal, and a tilt globalmotion signal based on the forward block motion vectors; code forgenerating dominant global direction clusters based on said pan, tilt,and zoom generated global motion signals, wherein said dominant globaldirection clusters comprise one or more of a pan left, pan right, tiltup, tilt down, zoom in, zoom out and global still motion cluster; codefor selecting potential key frames of each said video clip using saidgenerated dominant global direction clusters and a set of predefinedrules; code for removing redundant key frames from said selectedpotential key frames using a predefined set of heuristic rules resultingin a first set of remaining selected key frames; code for removingsimilar and/or repeated key frames from said first set of remainingselected key frames using a colour histogram technique resulting in asecond set of remaining selected key frames; and code for decompressingsaid second set of remaining selected key frames to obtain saidextracted key frames.